Implement tesseract backend #375

jonchang · 2024-11-07T19:59:23Z

Description

Implements the tesseract backend for OCR.

Testing notes

You can test this on the front-end by applying this patch:

diff --git a/OCR/ocr/api.py b/OCR/ocr/api.py
index 444c834..ec5603d 100644
--- a/OCR/ocr/api.py
+++ b/OCR/ocr/api.py
@@ -9,6 +9,7 @@ from fastapi import FastAPI, UploadFile, Form
 from fastapi.middleware.cors import CORSMiddleware
 
 from ocr.services.image_ocr import ImageOCR
+from ocr.services.tesseract_ocr import TesseractOCR
 from ocr.services.alignment import ImageAligner
 from ocr.services.image_segmenter import ImageSegmenter, segment_by_color_bounding_box
 
@@ -29,7 +30,7 @@ app.add_middleware(
 segmenter = ImageSegmenter(
     segmentation_function=segment_by_color_bounding_box,
 )
-ocr = ImageOCR()
+ocr = TesseractOCR()
 
 
 def data_uri_to_image(data_uri: str):

Related Issues

#321

Checklist

The title of this PR is descriptive and concise.
My changes follow the style guidelines of this project.
I have added or updated test cases to cover my changes.
I've let the team know about this PR by linking it in the review channel

jonchang · 2024-11-20T16:51:26Z

OCR/dev-dockerfile

@derekadombek these are the dockerfile-related changes

Oh gotcha! Kinda what I was imagining. makes sense. like what we chatted about earlier, it shouldn't be much of a difference in build time. Now that we're adding this though, do you know if we're able to eliminate other installed dependencies to make these images smaller or will they still be needed?

Not sure if we'll be able to get this in or not by January, but it would be nice to scan these images for CVEs

I'll be honest I have no clue why ffmpeg and xlib are in there. I can look into it though if the image size is a problem. I also note that we don't clean up after apt update which is also a concern

arinkulshi-skylight

LGTM. Lets create a new ticket to call the function in the API and test the entire flow.

arinkulshi-skylight · 2024-11-20T21:21:48Z

OCR/ocr/services/tesseract_ocr.py

+        # Nothing matched, just return the default path
+        return tesserocr.get_languages()[0]
+
+    def image_to_text(self, segments: dict[str, np.ndarray]) -> dict[str, tuple[str, float]]:


TODO: init class and invoke fxn in api call.

arinkulshi-skylight linked an issue Nov 12, 2024 that may be closed by this pull request

Implement Tesseract as an alternative model that can be used in backend #249

Closed

3 tasks

jonchang added 4 commits November 14, 2024 13:45

Initial tesserocr

e641bd8

drop pytesseract

1e3827f

Use actual raw API backend for confidence score

6a926dc

ensure PIL image is passed

11c48ca

jonchang force-pushed the tesseract-backend branch from 976f0d9 to 11c48ca Compare November 14, 2024 21:45

jonchang added 9 commits November 15, 2024 15:22

Guess at tessdata path

93ce87c

Install tesseract as part of docker setup

6586779

documentation

6aefc3e

lint check

c60af35

Use tesserocr api instead of pathlib shenanigans

0cb3182

Update docstring

b28dd12

Fix path detection crash

ea85a6d

Strip tesseract output

8fcf3f5

Update tests for tesseract comparisons

a31b4eb

jonchang marked this pull request as ready for review November 20, 2024 16:19

Update CI runs

a9c815b

jonchang commented Nov 20, 2024

View reviewed changes

jonchang mentioned this pull request Nov 20, 2024

Reduce the size of the OCR Docker image #412

Closed

arinkulshi-skylight approved these changes Nov 20, 2024

View reviewed changes

jonchang added this pull request to the merge queue Nov 20, 2024

Merged via the queue into main with commit 88ffe5b Nov 20, 2024
2 checks passed

jonchang deleted the tesseract-backend branch November 20, 2024 23:06

This was referenced Nov 20, 2024

Connect tesseract OCR to front end API #414

Closed

Implement tesseract for additional OCR backend #352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement tesseract backend #375

Implement tesseract backend #375

jonchang commented Nov 7, 2024 •

edited

Loading

jonchang Nov 20, 2024

derekadombek Nov 20, 2024

jonchang Nov 20, 2024

arinkulshi-skylight left a comment

arinkulshi-skylight Nov 20, 2024

Implement tesseract backend #375

Implement tesseract backend #375

Conversation

jonchang commented Nov 7, 2024 • edited Loading

Description

Testing notes

Related Issues

Checklist

jonchang Nov 20, 2024

Choose a reason for hiding this comment

derekadombek Nov 20, 2024

Choose a reason for hiding this comment

jonchang Nov 20, 2024

Choose a reason for hiding this comment

arinkulshi-skylight left a comment

Choose a reason for hiding this comment

arinkulshi-skylight Nov 20, 2024

Choose a reason for hiding this comment

jonchang commented Nov 7, 2024 •

edited

Loading